A heuristic method based on a statistical approach for Chinese text segmentation

نویسندگان

  • Christopher C. Yang
  • Kar Wing Li
چکیده

The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result shows that the proposed heuristic method is promising to segment the unknown words as well as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting Chinese Unknown Words by Heuristic Method

Chinese text segmentation is important in Chinese text indexing. Due to the lack of word delimiters in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, the segmentation ambiguities and the occurrences of out-of-vocabulary words (i.e. unknown words) are the major challenges in Chinese segmentation. Many research works dealing with the problem of ...

متن کامل

A Heuristic Method for Chinese Segmentation

Research and development in digital library includes content creation, conversion, indexing, organization, and dissemination, where the key technological issues are how to search and display desired selections from and across large collections effectively [10]. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the ...

متن کامل

Design of CKIP Chinese Word Segmentation System

In this paper, we describe the design of the CKIP Chinese word segmentation system and analyse its performance. The system utilizes a modulized approach. Independent modules were designed to solve the problems of segmentation ambiguities and identifying unknown words. Segmentation ambiguities are resolved by a hybrid method of using heuristic and statistical rules. Regular-type unknown words ar...

متن کامل

Realignment from Finer-grained Alignment to Coarser-grained Alignment to Enhance Mongolian-Chinese SMT

The conventional Mongolian-Chinese statistical machine translation (SMT) model uses Mongolian words and Chinese words to practice the system. However, data sparsity, complex Mongolian morphology and Chinese word segmentation (CWS) errors lead to alignment errors and ambiguities. Some other works use finer-grained Mongolian stems and Chinese characters, which suffer from information loss when in...

متن کامل

Sequence segmentation for statistical machine translation

In the last decade, while statistical machine translation has advanced significantly, there is still much room for further improvements relating to many natural language processing tasks such as word segmentation, word alignment and parsing. Human language is composed of sequences of meaningful units. These sequences can be words, phrases, sentences or even articles serving as basic elements in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JASIST

دوره 56  شماره 

صفحات  -

تاریخ انتشار 2005